Compiler Supported Interval Optimisation for Communication Induced Checkpointing
نویسندگان
چکیده
There exist mainly three different approaches of checkpoint-based recovery mechanisms for distributed systems: coordinated checkpointing, uncoordinated checkpointing and communication induced checkpointing. It can be shown that communication induced checkpointing theoretically has the least minimum overhead, but also that the effective overhead depends on the communication behaviour and the resulting forced checkpoints. If the placement of checkpoints and the communication pattern is disadvantageous, the overhead can get arbitrary large due to a high number of forced checkpoints. We introduce a compiler supported approach to avoid unfavourable combinations of communication behaviour and local checkpoint placement. We analyse the application statically and prepare the placement of voluntary checkpoints. These placement decisions are reviewed during runtime. With this approach we optimise the effective checkpoint-intevals of voluntary and forced checkpoints and thus reduce the overhead of communication induced checkpointing.
منابع مشابه
Compiler-assisted Full Checkpointing
This paper describes a compiler-based approach to checkpointing for process recovery. The implementation is transparent to both the programmer and the hardware. The compiler-generated sparse potential checkpoint code maintains the desired checkpoint interval. Adaptive checkpointing reduces the size of the checkpoints. Training is used to select low-cost, high-coverage potential checkpoints. The...
متن کاملProtocol for Coordinated Checkpointing using Smart Interval with Dual Coordinator
Introduction to Distributed System Design, Google Code University, http://code. google. com/edu/parallel/dsd-tutorial. html#Basics D. Manivannan, R. H. B. Netzer & M. Singhal, "Finding Consistent Global Checkpoints in a Distributed Computation", IEEE Trans. On Parallel & Distributed Systems, Vol. 8, No. 6, pp. 623-627 (June 1997) J. Tsai & S. Kuo, "Theoretical Analysis for Commun...
متن کاملType-Safe Object Exchange Between Applications and a DSM Kernel
The Plurix project implements an object-oriented Operating System (OS) for PC clusters. Communication is achieved via shared objects in a Distributed Shared Memory (DSM) using restartable transactions and an optimistic synchronization scheme to guarantee memory consistency. We contend that coupling object orientation with the DSM property allows a type-consistent system bootstrapping, quick sys...
متن کاملAdjoints for Time-Dependent Optimal Control
The use of discrete adjoints in the context of a hard time-dependent optimal control problem is considered. Gradients required for the steepest descent method are computed by code that is generated automatically by the differentiation-enabled NAGWare Fortran compiler. Single time steps are taped using an overloading approach. The entire evolution is reversed based on an efficient checkpointing ...
متن کاملA Performability Model for Applications using Checkpointing
An analytical model is used to investigate the effects of checkpointing on the performance and availability of sequential and parallel applications. Known as Steady-State Performability (SSP), this model provides a probabilistic method for quantifying delivered performance considering failure and recovery. Input parameters describe both the distributed application and the processing environment...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007